Topic Modeling of CORD-19 Dataset

Introduction

In this project, we will try to investigate various methods of topic modeling using sparse matrix factorization.

Topic modeling is a common task in natural language processing (NLP), in which we try and model the range of topics in a particular corpus, as well as represent each document as a distribution of topics \cite{jelodar2019latent}. There are a variety of methods by which this is done. The most famous of which is Latent Dirichelt Allocation, a generative probablistic model \cite{blei2003latent}. In this method, we consider a generative process by which we create a document based off of some latent topics. These topics have word distributions. When we create a document, we use a Dirichlet distribution to first select a topic, then a word from that topic's distribution. Given this generative process, given a corpus, we can use various statistical inference methods to determine the underlying latent topic distributions.

Matrix Factorization

However, we can also use Nonnegative Matrix factorization to model this problem \cite{lee1999learning}. In this case, we take a sparse matrix, such as a document-term matrix and attempt to approximate a low-rank approximation of the form

$$ \mathbf{X} \approx \mathbf{W}\mathbf{H} $$

where

$$ \mathbf{X} \epsilon \mathbb{R}^{p \times n}, \mathbf{W} \epsilon \mathbb{R}^{p \times r}, \mathbf{H} \epsilon \mathbb{R}^{r \times n} $$

and hopefully $r << p$. In our case, this would mean that we'd end up with a word-topic matrix and a topic-document matrix. We can then use these independent documents to either generate new documents of various topic distributions or quantify the topic distribution of a new document! Exciting stuff!

Hierarchical Poisson Factorization (HPF)

In addition to NMF, we'll also be implementing a more novel approach, known as hierarchical Poisson factorization (HPF) \cite{gopalan2015scalable}. HPF was developed as an improvement to the recently popularized (by the Netflix Prize) recommendation problem. The algorithm has its roots in NMF, but makes some significant modifications. In the recommendation setting of the problem, HPF takes in a user behavior matrix, which is a matrix of users as rows, and items that have been consumed as columns. For recommendation, we have a binary matrix (1 = consumed, 0 = not consumed), however, we can use continuous ratings instead.

The data are modeled as factorized Poisson distributions, where each user has K latent preferences and each item has K latent attributes. The observations (entries in matrix) are modeled as a Poisson distribution, parameterized by the inner product of the corresponding user and item latent value vectors. Given this matrix, the basic generative model is as follows:

  1. For each user u:
    1. For each latent variable k:
      • sample preference $\theta_{uk} \sim Gamma$
  2. For each item i:
    1. For each latent variable k:
      • sample attribute $\beta_{ik} \sim Gamma$
  3. For each u and i, sample rating/consumption:
    • $y_{ui} \sim Poisson$

Each of the Gamma distributions above are also further parameterized by Gamma distributions. For simplicity, we've removed the parameters above, but the plate notation diagram is as below:

HPF_Plate_Notation

There are a number of similarities between our text corpus and the user behavior matrix

HPF Attribute User Behavior Corpus Topic Modeling
Additional Priors on latent variable Gammas Captures the diversity of users, some tending to consume more than others, and the diversity of items, some being more popular than others Some documents contain many topics while some are about a singular topic. Some topics are more prevalence than others.
Estimation of Posterior Expectation Each user’s preferences and each items attributes Each document's topics and each word's semantics
Model of Long-Tailed Behavior While most users consume a handful few items, a few “tail users” consume thousands of items While most documents only consist of a few topics, some documents may be review articles that have to do with many topics
Downweights effects of 0s A user that consumes an item must be interested in it so 1s are more relevant than 0s There are lots of words that are completely unrelated to a document, while a few are very relevant (within the corresponding topic). We should focus on these occurrences rather than the lack of most words

In the last example, normal randomized MF uses Gaussian likelihoods which weight seen and unseen examples equally, which in a sparse matrix, means most factors relie on the "unconsumed" user/item pairs. HPF corrects for this problem.


Now that we've provided some background and (hopefully) effectively motivated the use of HPF in Topic Modeling, let's get started!

We will be using the CORD-19 dataset from Semantic Scholar. The dataset consists of several thousand articles related to COVID-19 published in peer-reviewed publications and archival services like bioRxiv, medRxiv, and others. Some work in the topic modeling space has already been done using this dataset, with the goal of identifying novel research gaps, but we'll see how well HPF works \cite{doanvo2020machine}.

Create a Text Corpus

First, we'll create a dictionary from the titles and abstracts of all the CORD-19 articles by reading in the metadata.csv file. We originally had planned to stream over the full text paper directory but decided against it, due to size and runtime constraints.

Read in Data

Create a Dictionary

Originally, we were going to use the full text of the articles, but decided to lessen our scope to only include titles and abstracts. Previous NLP studies often take this approach, as we know that the majority of the important text information of the article, including hypotheses, methods, and conclusions are often found in the title and abstract.

The code to stream over the data is kept below as an example however.

We will use the dictionary to create a corpus.

Our corpus is represented as a Bag-of-Words (BoW) model. In Python, this is a list of (token_id, token_count) tuples. However, we can think of this as a very sparse document-term matrix as each row is a document (on the order of 200 tokens) and the columns are the full dictionary of tokens.

TF-IDF Transform

We tried this and it screwed up our factorization since we no longer had counts, so we undid it!

Topic Modeling Time!

Let's start with NMF and then we'll compare it with HPF. We'll be using LDAvis \cite{sievert2014ldavis} to compare, a visualization tool to view topic models and condense the significant amounts of necessary information into a digestable, interactive graphic.

Nonnegative Matrix Factorization

We'll use the gensim package to also train the NMF Model

Fit NMF

Row Normalize $\theta$ vectors

Normalizing our vectors gives us probabilities for each factor.

Visualize

We just visualize a few random documents, to see what distributions of various topics look like. Interesting, some do have quite a few topics, while others do not!

Cluster by Most Significant Factor

Hierarchical Poisson Factorization (HPF)

Fit HPC

We'll first need to convert our corpus into a form that HPF accepts. We choose the scipy.sparse coo matrix format, rather than a large pandas DataFrame.

Generate Random Document Text

Output the top 100 terms for two random patients to see what it looks like.

Let's compare it to the original documents to see if they make some kind of sense.

Row Normalize $\theta$ vectors

Normalizing our vectors gives us probabilities for each factor.

Visualize

We just visualize a few random documents, to see what distributions of various topics look like. Interesting, some do have quite a few topics, while others do not!

Cluster by Most Significant Factor

References

(Jelodar, Wang et al., 2019) Jelodar Hamed, Wang Yongli, Yuan Chi et al., ``Latent Dirichlet Allocation (LDA) and Topic modeling: models, applications, a survey'', Multimedia Tools and Applications, vol. 78, number 11, pp. 15169--15211, 2019.

(Blei, Ng et al., 2003) Blei David M, Ng Andrew Y and Jordan Michael I, ``Latent dirichlet allocation'', Journal of machine Learning research, vol. 3, number Jan, pp. 993--1022, 2003.

(Lee and Seung, 1999) Lee Daniel D and Seung H Sebastian, ``Learning the parts of objects by non-negative matrix factorization'', Nature, vol. 401, number 6755, pp. 788--791, 1999.

(Gopalan, Hofman et al., 2015) P. Gopalan, J.M. Hofman and D.M. Blei, ``Scalable Recommendation with Hierarchical Poisson Factorization.'', UAI, 2015.

(Doanvo, Qian et al., 2020) Doanvo Anhvinh L, Qian Xiaolu, Ramjee Divya et al., ``Machine Learning Maps Research Needs in COVID-19 Literature'', bioRxiv, vol. , number , pp. , 2020.

(Sievert and Shirley, 2014) C. Sievert and K. Shirley, ``LDAvis: A method for visualizing and interpreting topics'', Proceedings of the workshop on interactive language learning, visualization, and interfaces, 2014.